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Abstract 

When the data vectors are high- dimensional it is com- 
putationally infeasible to use data analysis or pattern 
recognition algorithms which repeatedly compute simi- 
larities or distances in the original data space. It is 
therefore necessary to reduce the dimensionality before, 
for example, clustering the data. If the dimensionality 
is very high, like in the WEBSOM method which orga- 
nizes textual document collections on a Self-Organizing 
Map, then even the commonly used dimensionality re- 
duction methods like the principal component analysis 
may be too costly. It will be demonstrated that the 
document classification accuracy obtained after the di- 
mensionality has been reduced using a random mapping 
method will be almost as good as the original accuracy if 
the final dimensionality is sufficiently large (about 100 
out of 6000). In fact, it can be shown that the inner 
product (similarity) between the mapped vectors follows 
closely the inner product of the original vectors. 

1. Introduction 

There exists a wealth of alternative methods for reduc- 
ing the dimensionality of the data, ranging from dif- 
ferent feature extraction methods to multidimensional 
scaling. The feature extraction methods are often tai- 
lored according to the nature of the data, and there- 
fore they are not generally applicable, for example, in 
all data mining tasks. The multidimensional scaling 
methods, on the other hand, are computationally costly 
already on their own, and if the dimensionality of the 
original data vectors is very high it is infeasible to use 
even linear multidimensional scaling methods (princi- 
pal component analysis) for dimensionality reduction. 

A new rapid dimensionality reduction method is 
needed for situations where it is impossible to use the 
original vectors as such and the existing dimensionality 
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reduction methods are too costly. The random map- 
ping method presented in this paper provides a compu- 
tationally feasible method for reducing the dimension- 
ality of the data so that the mutual similarities between 
the data vectors are approximately preserved. 

The motivation for the method dates back to ex- 
periments made by Ritter and Kohonen [10]. They 
organized words based on information of the contexts 
in which they tend to occur. The dimensionality of the 
representations of the contexts was reduced by replac- 
ing each dimension of the original space by a random 
direction in a smaller-dimensional space. 

It may seem surprising that random mapping can 
reduce the dimensionality of the data in a manner that 
preserves enough structure of the original data set to 
be useful. The main goal of this paper is therefore to 
explain why the random mapping method works well 
in high-dimensional spaces, using both analytical and 
empirical evidence. 

2. Random mapping method 

In the (linear) random mapping method the original 
data vector, denoted by n € R^, is multiplied by a 
random matrix R. The mapping 

x = Rn (1) 

results in a reduced-dimensional vector x € R a - The 
matrix R consists of random values and the Euclidean 
length of each column has been normalized to unity. 

One way of interpreting the random mapping is to 
consider what happens to each of the dimensions of the 
original space R N in the mapping. If the ith column of 
R is denoted by r» the random mapping operation (I) 
can be expressed as 

x = n « r * • ( 2 ) 
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Here the ith component of n is denoted by n». In the 
original vector n the components n» are weights of or- 
thogonal unit vectors, whereas in the expression . (2) 
each dimension t of the original data space has been 
replaced by a random, non-orthogonal direction Ti in 
the reduced-dimensional space. 

3. Properties of the random mapping 

The utility of the random mapping method for cluster- 
ing depends fundamentally on how it affects the mu- 
tual similarities of the data vectors. It is clear that 
the closer the vectors r; in (2) are to being orthonor- 
mal the better the similarities of the vectors obtained 
by random mapping correspond to the original similar- 
ities. A hint on why even choosing random directions 
for the vectors might be useful has been provided by 
Hecht-Nielsen [3]: there exists a much larger number of 
almost orthogonal than orthogonal directions in a high- 
dimensional space. Therefore, in a high-dimensional 
space even vectors having random directions might be 
sufficiently close to orthogonal to provide an approxi- 
mation of a basis. 

Below the distortions that the random mapping 
method causes on the mutual similarities of data vec- 
tors will be characterized statistically. 

3.1. Transformation of the similarities 

The cosine of the angle between two vectors is a com- 
monly used measure of their similarity. The results in 
this paper will be restricted to vectors with unit length; 
in that case the cosine can be computed as the inner 
product of the vectors. 

The inner product of two vectors, x and y, that 
have been obtained by random mapping of the vectors 
n and m, respectively, can be expressed using (1) as 
follows: 

x T y = n T R T Rm . (3) 
The matrix R T R can be decomposed into two terms, 
R T R=I + e, (4) 

where 

. (5) 

for i £ j, and en = 0 for all t. The components on 
the diagonal of R T R. have thus been collected into the 
identity matrix J in (4). They are always equal to unity 
since the vectors Ti have been normalized. The units 
off the diagonal have been collected into the matrix e. 
If all the entries in e were equal to zero, i.e., the vectors 
r t and r; were orthogonal, the matrix R T R would be 



equal to / and the similarities of the documents would 
be preserved exactly in the random mapping. In prac- 
tice the entries in c will be small but not equal to zero. 
Statistical properties of e. It is possible to analyze 
the statistical properties of the entries in e if we fix 
the distribution of the entries in the random mapping 
matrix R, i.e., the distribution of the components of 
the column vectors r^. Assume that the components 
are initially chosen to be independent, identically and 
normally distributed (with mean zero), and thereafter 
the length of all of the r t is normalized. The result of 
this procedure will be that the direction of the r* will 
be distributed uniformly. Then it is evident that 

E[e y ] = 0 (6) 

for all i and where E denotes the average over all 
random choices for the entries of i?.. 

In practice we always use one specific instance of 
the matrix R y and therefore we need to know more 
of the distribution of dj to judge the utility of the 
random mapping method. It can be proven (cf. Ap- 
pendix A) that if the dimensionality d of the reduced- 
dimensional space is large, is approximately nor- 
mally distributed. The variance, denoted by a\, can 
be approximated by 

a\ a 1/d . (7) 

The distribution of for several dimensionalities has 
been illustrated in Figure 1. The matrix R T R will 
approximate the identity matrix the better the higher- 
dimensional the vectors Ti are. 

Statistical properties of the mutual similarities. 
Now that we know the distribution of € it is possible to 
investigate more closely how the similarities of the orig- 
inal vectors are transformed in the random mapping. 
More specifically, given a pair n and m of original data 
vectors it is possible to derive the distribution of the 
similarity of the vectors x and y obtained by random 
mapping of n and m, respectively. 

Using equations (3), (4) and (5) the inner product 
between the mapped vectors can be expressed as 

x T y = n T m + ^ e kl n k mi . (8) 

Denote 8 = £ fcyU t k m k mi\ this expression is the de- 
viance form the original value of the inner product pro- 
duced by the random mapping. 

The mean of 6 is zero since the mean of each term 
in the sum is zero. It will be shown in Appendix B that 
the variance of 5, denoted by can be expressed as 

A = + E n * m *) 2 ~ 2 £ *l m ltf - (9) 
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Figure 1. Distribution of the inner products between 
pairs of random vectors i.e., the distribution of 
for different dimensionalities d. When d increases the 
inner products become smaller and the vectors rj be- 
come more orthogonal. The orthogonality will not be 
perfect but the generally small inner products con- 
tribute only small distortions in the similarity compu- 
tations. The normal distributions with variance equal 
to 1/d are plotted in the figure as well; the curves 
are only distinguishable from the empirical curves for 
d = 10, which demonstrates that the distribution of 
Cjj approximates the normal distribution already for 
fairly small values of d. 



When the length of the original data vectors n and 
m is fixed to unity their inner product is at most 1 and, 
based on (7), 

°\ <2a**2/d. (10) 

In summary, the distortion of the inner products pro- 
duced by the random mapping is zero on the average, 
and its variance is at most the inverse of the dimen- 
sionality of the reduced space (multiplied by 2). 

Consider then a simple but instructive setting where 
the data vectors are constrained to have only a certain 
amount, L, of ones and the rest of the components are 
zero. If K of the ones occur in the same position in 
both of the vectors (i.e., the inner product is K/L) 



If the inner product K/L is fixed then the variance of 
the error is the smaller the sparser the input is, i.e., the 
smaller L is. Random mapping will therefore function 
the better the sparser the data is. 



3.2. Random mapping and the SOM 

Let us next consider how random mapping of the data 
vectors affects the further processing of the data. The 
Self- Organizing Map (SOM) algorithm will serve as an 
instructive example case since it will be used in the 
experiments reported in Sec. 4. The conclusions are 
valid for distance-based clustering algorithms as well. 

The SOM algorithm [7] constructs a mapping from 
the input space onto a usually two-dimensional lattice. 
Each lattice position called a map unit contains a model 
vector, and as a result of the algorithm the model vec- 
tors of neighboring map units gradually learn to rep- 
resent similar input vectors. The mapping becomes 
ordered. The resulting map is an intuitive, abstract 
representation of the data set. The map can be used 
for example in data exploration applications, but in a 
multitude of other applications as well (cf. [7]). 

The SOM algorithm consists of two steps that are 
applied iteratively. First the winning unit whose model 
vector is closest to the current input is selected, and 
thereafter the model vectors of the units that are neigh- 
bors of the winning unit on the map lattice are updated. 

It may be useful to notice that since the random 
mapping operation is linear, small neighborhoods in 
the original space will be mapped onto small neighbor- 
hoods in the smaller-dimensional space. In the SOM 
the model vectors of neighboring units are generally 
close-by, and therefore small neighborhoods in the orig- 
inal space will mostly become mapped onto a single 
map unit or onto a set of neighboring map units. The 
SOM will thus probably not be too sensitive to the dis- 
tortions of the similarities caused by the random map- 
ping. 

Before considering the effects the random mapping 
of the inputs has on the learning of the SOM we must 
consider the concept of the nullspace of the mapping 
operator R. The mapping operation can be considered 
as a "change of basis" to the (non-orthonormal) "basis" 
formed of the rows of R. The rows form a set of ran- 
dom vectors in the original space. The nullspace of the 
operator R is that subspace of the original space that 
becomes mapped to the zero vector. 

Each input vector n that resides in the original data 
space can be decomposed into a unique sum of two 
orthogonal components n and n = n — n, where n. 
belongs to the nullspace of R and n to its complement. 
When the input vector n is mapped with the random 
mapping operator the result reflects only the parts of 
n that are orthogonal to the nullspace, 

Rn = Rh. (11) 

The projection thus in effect removes the parts of n 
that reside in the nullspace of R. 



415 



When the mapped vector Rn(t) is input to the SOM 
at time step t the model vectors m x are updated accord- 
ing to the rule 

xm(t + 1) = m t (t) + hd(t)[Rn - 1X1,(0] , (12) 

where h ci is the so-called neighborhood kernel, a de- 
creasing function of the distance between the units i 
and c on the map lattice. Here c is the index of the 
unit whose model vector is closest to Rn{t). 

The update in (12) occurs in the mapped space but 
actually we are more interested in comparing the re- 
sults of the update with the results obtained with a 
SOM that would operate on the inputs n in- the orig- 
inal space. It is in fact possible to consider a 'Virtual 
image" of the model vectors ir^ in the original space 
or, stated more exactly, the virtual images of the model 
vectors in the complement of the nullspace of the map- 
ping operator R. 

If we denote the pseudoin verse of R by R? then the 
virtual image of the model vector in the original 
space is denned to be R^mi. Let us denote this vir- 
tual image by riii; the image is the vector that has the 
smallest norm among all the vectors that R maps onto 
mi. If we multiply both sides of (12) by R) we get 

m £ (t + 1) = Ai(t) + h ci {t)[h - m f (t)] . (13) 

The learning rule then, in effect, corresponds to learn- 
ing in the original data space, but in the complement 
of the nullspace of R. 

It may seem disadvantageous to deliberately neglect 
the rest of the vectors n, namely the component n, 
but it will be demonstrated empirically in Sec. 4 that 
even a reduction from a 5781-dimensional space to a 90- 
dimensional one with random mapping produces satis- 
factory results. It may be striking that in this case the 
null-space is 5691-dimensional and only 90 randomly 
chosen dimensions are taken into account. The reason 
for the good results is probably most clearly recogniz- 
able based on equation (10): for 90-dimensional vectors 
the variance in the similarity is smaller than 2.2 % of 
the largest possible similarity. 

4. Experiments: mapping of textual 
documents in the WEBSOM system 

4.1. The WEBSOM system 

The WEBSOM [4, 6, 8, 9] is a method for organizing 
textual documents onto a two-dimensional map display. 
Nearby locations on the display contain similar doc- 
uments, which aids in browsing the document, collec- 
tion. The map can also be used for content-addressable 
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Figure 2. A schematic diagram of the basic building 
blocks of the WEBSOM system. In the WEBSOM 
the documents are first encoded into numerical vec- 
tors and then mapped onto a two-dimensional display 
of the document collection using the SOM algorithm. 



search, and for filtering interesting documents from an 
incoming document stream. 

The task of encoding the documents in the WEB- 
SOM system (Fig. 2) will be used here as a case study 
of the random mapping method. It should be noted, 
however, that for the sake of clarity the case study does 
not include all of the possible ingredients of the WEB- 
SOM system. 

4.2. Encoding of documents using random map- 
ping 

In a simple but yet very effective document encoding 
method, called the vector space model [11], the doc- 
uments are represented by vectors in a space where 
each dimension corresponds to one word. The value 
of each component is equal to the relative frequency 
of occurrence of the corresponding word in the docu- 
ment. Alternatively, some function of the frequency 
of occurrence and the importance of the word may be 
used. The resulting vectors can be thought of as repre- 
senting the word histograms of the documents. When 
the length of the document vector is normalized the 
direction of the vector will reflect the contents of the 
document. 

It is unfortunately impossible to use the vector space 
model as such for large document collections since the 
dimensionality of the resulting document vectors would 
be very high. There are as many dimensions in the 
vectors as there are words in the vocabulary. The vec- 
tor space model thus seems to be an ideal candidate 
for random mapping. In fact, random mapping of the 
word histograms has already been shown to produce 
promising results in preliminary experiments [5]. 

4.3. Results 

The usefulness of the random mapping method in re- 
ducing the dimensionality of the document vectors was 



416 



measured using an index that has been designed to 
measure the relative goodness of different document en- 
coding methods in the WEBSOM system. The goal of 
the WEBSOM is to produce a map where each location 
contains a set of similar articles and close-by locations 
contain similar sets. It would be very laborious to as- 
sess the success of the method in the subtle details of 
this task but it is possible to measure how well different 
topic areas are separated on the map of the document 
collection. The document collection used in the experi- 
ments consisted of about 18000 articles from 20 Usenet 
newsgroups, and the groups were considered to rep- 
resent different topic areas. It should be noted that 
although some of the most similar newsgroups were 
grouped together the groups are still highly overlap- 
ping. Thus, the separability of the groups can only 
be used as a relative criterion for comparing different 
document encoding methods and not as an absolute 
measure of the goodness of the WEBSOM method. 

Before constructing the word histograms for the 
documents the rarest and some common words were 
removed. After the removals the dimensionality of the 
document vectors was 5781. In the histograms each 
word was weighted with an entropy-based weight [8]. 
The separability of the newsgroups on the document 
map was measured by teaching a 768-unit SOM using 
the encoded documents as inputs, and labeling each 
map unit according to the group that dominated the 
unit. The separability of the newsgroups was measured 
as the total number of documents from the other groups 
than the dominating one in the nodes. All computa- 
tions were made using the same text document mate- 
rial; this corresponds to the usage of the WEBSOM 
method in many real situations. It is often more im- 
portant to construct a good map of a certain document 
collection than to be able to generalize the result to new 
documents. 

The separability of the newsgroups as a function of 
the dimensionality d obtained by the random mapping 
method is depicted in Figure 3 together with the results 
obtained with PCA. The PCA is essentially equivalent 
with the latent semantic indexing method [1] that has 
been used to reduce the dimensionality of document 
vectors. The separability obtained with PCA rises very 
rapidly and saturates around d > 50. The random 
mapping requires somewhat larger dimensionalities but 
if d > 90 the results are essentially as good as those 
obtained with PCA, and almost as good as the results 
obtained with the original vectors. Moreover, the com- 
putational complexity of forming the random matrix, 
O(Nd), is negligible to the computational complex- 
ity of estimating the principal components, D{nN 2 ) -f 
Q(N 3 ) [2]. Here N and d are the dimensionalities be- 
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Figure 3. Separability of topic areas on a WEBSOM 
document map as a function of the dimensionality d 
of the document vectors obtained by random map- 
ping or PCA. The bars denote the standard devia- 
tions of 7 experiments. The separability obtained 
using the original document vectors was 68.0%. 



fore and after the random mapping, respectively, and 
n is the number of data vectors. 

5. Discussion 

The random mapping method has been shown to of- 
fer a promising, computationally feasible alternative 
for dimensionality reduction in situations where the 
reduced-dimensional data vectors are used for cluster- 
ing or other similar approaches. Especially if the orig- 
inal dimensionality of the data is very large it is infea- 
sible to use more computationally costly methods like 
the PCA. 

The method has been applied in the WEBSOM doc- 
ument organization system. The dimensionality of the 
original data vectors that describe the documents is 
very high, of the order of thousands, and therefore the 
computations required to construct a Self-Organized 
Map would be infeasible without a rapid dimensional- 
ity reduction method. The random mapping method 
was demonstrated to produce essentially as good re- 
sults as the PCA or the original data vectors if the di- 
mensionality of the mapped vectors is about a hundred 
or more. 

The random mapping method has also produced 
better separability of different topic areas (68% vs. 63%, 
[5]) than an alternative method [4, 6, 8, 9] in which, 
however, the encoding of the documents is faster. 

There exist straightforward neural implementations 
of the random mapping method; in this paper the em- 
phasis has, however, been more on the properties of the 



417 



mapping than on the implementation. 
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Appendix: The proofs 

A. Equation (7) 

The distribution of e^- can be derived fairly easily since 
the vectors r* and Tj in (5) consist of independent nor- 
mally distributed values that have been normalized so 
that the length of both of the vectors equals unity. The 
inner product in (5) is then in fact an estimate of the 
correlation coefficient between two independent, iden- 
tically and normally distributed random variables. The 
normalization of the vectors corresponds to the normal- 
ization of the estimate by square roots of the sums of 
squares of the instances of the random variables. It is 
an old result, due to Fisher, that l/21n(l-heij)/(l-e^) 
is normally distributed with variance equal to l/{d — 3) 
if d is the number of samples in the estimate. If this 
equation is linearized around zero the claim follows for 
large d. 

B. Equation (9) 

Based on the definition of 5, 

v\ = El{^2^kin k Tni)(Y2e pq n p m q )] 

k#l P*q 

= X] n k T ntn p m q E[eki€ pq ) . 

It is straightforward to verify that E[e k i£pq] = 0 unless 
k = p and I = q, or k = q and / = p. Hence, 

k& k^l 

k l^k k l^k 

= [l-]T«!m 2 t 

k 

+(£n k m k ) 2 - J3n^m|K 2 

k k 

= [1 + (^n*m*) 2 - 2 £n£m|]<7? • 

k k 



Here we have used the assumption that the data vectors 
have been normalized, i.e., Ylk n * = 1 m * = 
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